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Abstract 



Many features from texts and languages can now be inferred from statistical analyses us- 
ing concepts from complex networks and dynamical systems. In this paper we quantify how 
topological properties of word co-occurrence networks and intermittency (or burstiness) in word 
distribution depend on the style of authors. Our database contains 40 books from 8 authors 
who lived in the 19th and 20th centuries, for which the following network measurements were 
obtained: clustering coefficient, average shortest path lengths, and betweenness. We found that 
the two factors with stronger dependency on the authors were the skewness in the distribution 
of word intermittency and the average shortest paths. Other factors such as the betweeness 
and the Zipf's law exponent show only weak dependency on authorship. Also assessed was the 
contribution from each measurement to authorship recognition using three machine learning 
methods. The best performance was a ca. 65 % accuracy upon combining complex network 
and intermittency features with the nearest neighbor algorithm. From a detailed analysis of 
the interdependence of the various metrics it is concluded that the methods used here are 
complementary for providing short- and long-scale perspectives of texts, which are useful for 
applications such as identification of topical words and information retrieval. 
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1 Introduction 



The application of ideas from statistical physics to text analysis has a long tradition since Shannon's 
usage of entropy as the central concept in information theory [Ij. In recent years, physicists have 
proposed new approaches based on concepts from complex networks [2i|3j|IJ[5j|6l[7J|8l[9j [TOl HU |12j 
fT3l [HI [15] and dynamical systems [TEl [T71 UHl USl ED]- In the former, text is represented as complex 
networks with words (nodes) being connected (links) using procedures depending on their syntactic or 
semantic relationships [2]. Several of these networks share topological properties such as the scaling 
in the degree (3j H] and the small world feature [6]. The co-occurrence networks, where adjacent 
words are linked to each other, are probably the most popular for applications owing to their ability 
to capture important syntactic and semantic aspects of texts with a straightforward construction 
procedure. These networks were employed to evaluate writing quality [7J and machine translations [El 
H], to generate and evaluate summaries [TP] , to construct spell checkers [IT] , to recognize patterns 
in poetry [12] and prose JT3J [H] and to study general properties of written language [15]. While co- 
occurrence networks focus mainly on short scales, an increasingly popular approach addresses longer 
text scales [T6], [13, [TEl EH E2J EE! El] . The usefulness of this latter approach stems from the finding 
that topical words are unevenly distributed along the text when compared to a random process 
or to function words. This observation can be quantitatively investigated using different analogies 
and measures familiar to the communities of statistical physics and dynamical systems, including 
level statistics [221 El], burstiness [T7J [TBI E3> entropy [21], and intermittency measures [211] • The 
author dependency on the features mentioned above has been noticed [TU [T7] , but little work has 
been devoted to quantify the extent of this dependency and to test its usefulness to the automatic 
detection of authors. 

In the field of authorship recognition (or stylometry), one tries to identify the author of documents 
whose identity is lacking [22] . Some simple quantitative proposals, such as the use of word length to 
distinguish between authors, go back to the mid 19th century (see Ref. [2S] for a historical account). 
One important recent contribution was given by Mosteller and Wallace [2Z] who showed that the 
frequency of function words (such as "any", "from", "an", "there" and "upon") can be used to 
characterize the style of authors. This feature is so strong that even letter pair frequencies can provide 
a good distinguishability between authors [26J. Frequent words are also responsible for the success 
of the approach - proposed and investigated by physicists - that consist in quantifying the similarity 
between two books based on the distance between their word-frequency rankings [281 [291 130] . More 
recently, new features have been proposed: word length, sentence length; frequency of punctuation 
marks and contractions; frequency of graphemes, collocations and words. A summary of these recent 
results is given in Ref. [3T] . 

In this paper we investigate how the metrics of complex networks and intermittency - familiar to 
physicists - depend on the style of authors. We start quantifying the metrics for each word (Sec. [2]). 
These are used in the definition of global features for each book that are tested according to their 
efficiency in algorithms of authorship classification (Sec. [3]). Finally, we discuss the importance of 
each feature (Sec. El). The primarily goal of this paper is not to improve state-of-the-art methods of 
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automatic authorship recognition. Instead, we wish to estimate the dependency on authorsihp of the 
selected metrics. We evaluate our results using authorship classification tests because they provide a 
statistical rigorous method to quantify the importance of different features. Nevertheless, our study 
reveals interesting insights which are potentially useful in real applications and therefore we include 
a comparison to more traditional statistical natural language methods (that increase the efficiency 
from 62.5% to 90.0% correct attributions, see Sec. 4.2). 



2 Statistical quantification of the role of words in texts 
2.1 Database 

Our database contains 5 novels from each of 8 authors who lived between 1809 and 1975, which are 
available in an online repository (http://www.gutenberg.org/). The list of books is summarized in 
Sec. 1 of the Supplementary Information (SI). To avoid effects from the length of the texts, each 
book was limited to their first 18,200 tokens, which corresponds to the length of the shortest book. 
In the remainder of this section we illustrate our results using the book "The Adventures of Sally", 
by P. G. Wodehouse. The results for all books appear in (Sl)-Sec. 3 and are discussed in Sec. [3] 
below. 



Pre-processing of the text. 

Before extracting complex networks and intermittency measurements from the texts, some prepro- 
cessing steps were applied. Initially, a pre-compiled list of stopwords including articles, prepositions 
and adverbs were removed from the text (see Si-Sec. 2). Previous work for recognizing authorship 
used the frequency of function words, but we decided not to use them in our study because we are in- 
terested in the interrelation between words with a pronounced semantic content. This procedure has 
been employed in many works (see e.g. Refs. [7J El [9j [121 [15]) an d it is crucial to determine how these 
techniques depend on the writing style of each author. Next, a lemmatization step was applied to 
the remaining words using the MXPost part-of-speech Tagger based on the Ratnaparki's model |32j . 
Table [T] exemplifies the application of these pre-processing steps. With this standardization, we 
grouped together all words referring to a same concept, despite the differences in flexion. 



2.2 Network measurements 

Complex networks have been used to characterize different properties of languages [SI El El El QUI 
HH [33] . Here we are interested in author-specific characteristics and therefore we adopt a network 
description based on word co-occurrence [HI [71 El El [Til E3], where nodes are words and links are 
established between subsequent words. This procedure is illustrated in Fig. [TJ The network is 
defined by a set V = {t>i, v 2 , ... v n } of vertices, and a set E of edges and is represented as a 
nonsymmetric weighted matrix W. By construction, W is a square matrix of size n, where n is the 
number of distinct words after the pre-processing step. The elements Wij of W indicate the strength 
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Table 1: Example of the pre-processing steps applied to the texts. An extract (first column) 
obtained from the book "The Adventures of Sally" , by Pelham Grenville Wodehouse is shown after 
the removal of the stopwords (second column) and after lemmatization (third column). 



Original 



Without stopwords After lemmatization 



What's that ? asked Sally. 
Pay my bill for last week, 
due this morning. Sally got up 
quickly, and flitting down the 
table, put her arm round her 
friend's shoulder and whispered 
in her ear. 



asked Sally 

pay bill last week 

morning Sally got 

quickly flitting 

table put arm 

friend shoulder whispered 

ear 



ask Sally 

pay bill last week 

morning Sally get 

quickly flit 

table put arm 

friend shoulder whisper 

ear 



that Vi is connected to Vj (vi — > Vj), i.e. the number of times word v j appears immediately after word 
Vi. Additionally, we used the non-weighted and undirected network corresponding to W, denoted by 
the matrix A whose elements = 1 if the words represented by the vertices t>; and v j appeared as 
neighbors at least once in the text. Otherwise, = 0. 

We shall use the statistical properties of complex network measurements in the networks W and 



A. In this section we discuss the word-specific local measurements and in Sec. 3.1 we show how to 



connect them to obtain a global characterization of the network. The number of occurrences Ni of 
each word i represented by node Vi is 



N 



^out 



w 



■'J 1 



(1) 



where s in and s out are respectively the weight (or strength [SUES]) of the ingoing and outgoing edges 
of node Therefore, the degree of each node V{ is the frequency of appearance of word V{ and 
the degree distribution of W is proportional to the normalized frequency (fi = N/Nt, Nt = ^ Ni) 
distribution of words, given by the Zipf 's law [35J EZ]- Below we discuss three typical measurements: 
clustering coefficient, average shortest path length, and betweenness centrality. 



Clustering Coefficient C. 

The clustering coefficient (C) measures the probability that the neighbors of a given vertex v$ are 
connected. This measurement has been widely employed in complex networks, e.g. to verify the 
presence of communities [381 ESS SO] and to distinguish random networks from other small world 
networks [51 HI]. Traditionally, the clustering coefficient is defined without considering weights or 
directions as: 

q _ g J2k>j>i a ij a ikO'jk ^ 

^-jk>j>i ®ij®ik Q'jiQ'jk ^ki^kj 

which is equivalent to the fraction of the number of triangles among all possible triads of connected 
nodes, and therefore ranges from to 1. With regard to the interpretation of this measure, Ferrer 

lr The triple equality in Eq. dlb is not valid for the first and for the last word in the text. The correct expression 



for the first word is = k out (i) 



and for the last word it is 2Vj = k in (i) = J2 t 
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(a) 



(b) 



Figure 1: Example of networks: (a) the subgraph obtained for the sentences shown in Table\lj and 
(b) the global network obtained from the first 1,000 associations of the same book. 

i Cancho and Sole [5] found that the clustering coefficient of networks representing text was much 
larger than the one expected just by chance (i.e., the value expected for the corresponding random 
networks) . 

We singled out the words (and their neighbors) with the highest and lowest values of clustering 
coefficient in the book "The Adventures of Sally", by P. G. Wodehouse, which are shown in Table 
[2] for the frequency N = ^ From the definition, one expects that words with highest C to have 
neighbors also connected to each other. This is the case of the words "sand" and "excitement" . On 
the other hand, words whose neighbors are not related to each other at all display low values of C (e.g. 
there is no link between the neighbors of "full" or between the neighbors of "high"). Qualitatively, 
the clustering coefficient quantifies how words are connected to specific contexts. Indeed, the words 
"sand" and "excitement" tend to be more restricted to a specific context, while "full" tends to appear 
in a myriad of contexts. Therefore, it seems that the clustering coefficient can be useful to detect 
authorship by quantifying the tendency of using semantic-specific or generic words. 

Average Shortest Path Length L. 

A shortest path (or geodesic path) between two nodes is defined as the path whose sum of edge 
weights is minimum. We start defining d^ as the length of the shortest path between v\ and Vj 
(in this case A is employed). Then the average shortest (or geodesic) path length for Vi (Li) is the 
average shortest path to all other (n-1) nodes of the network: 



2 The words with TVj < 5 were considered to lack statistics and were disregarded in all the analysis involving the 
clustering C. 




(3) 
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Table 2: Words of the book "The Adventures of Sally" with the highest and lowest clustering 
coefficients (the average clustering (C) = 0.085), for words with JVj = 5. The five words with C = 
were randomly selected among the 18 words with JV = 5 and C — 0. 



Word 


Neighbors 


N 


C 


hiiui Liy 


LWclvt;, bet:, bdy, Ddliy, ilt;Wb, 

never, heaven, find, enter, Carmylle 


K 



n 97 


pypi l^Pin pttI~ 

CTvL-l tClllCllt 


t 1 1 1 1 o' en nnrpcQpr] q 1 1 \T in in pp 1 £i ct~ 
tlllllg,, allJJjJl taat<U., Odllj, 111111 Idot, 

can, come, bristle, brief, apart 


K 
o 


W. £0 


oaiiu 


wafpn w?i cit QniiTTiiricr QP^i" 

VV d t ^11, VV dllt , alt, OllLLllllllg, , oCd u , 

here, golden, first, dark, Roville 


K 


18 




\7"phpp f l cioTf QTrmi" mnnpi'Mn 

VWll^C, ul^J, OUI u, Ollltit, OlllWWtll 

Sally, oh, glance, tell, come 


•J 


18 


PHI 1 T1TT"\7 


tllllt; , o till , oUlilt; W lid t: , odj , JJlcLUc 

may, happen, great, glorious 


K 
•J 


n 1 s 


cH" n t*"I"1 p 

otdl tie 


QnV CPPTTI TY1 ill 1 1 ^"Mp 0*T P "1" 1 TAT 
ally, QCt/111. IllJ.il, llttlt, ^Idtliy, 

gather, first, everyday, displeased, considerably 


•J 


n nn 


n l o*n 


rpppec monfri mofivp larmp Ticniyp 

1 CtCOO , 111W U. til, 111W 1 1 V C , IdjJDC , ll£, til c , 

even, disposal, critical, collar, check 


'.J 


00 

W . WW 


6 U1U 


voice snin nencil loan knob 
information, high, heavy, frame, buy 


5 


0.00 


gift 


tongue, take, sort, potential, mean, 

few, easily, compensating, blessing, acquire 


5 


0.00 


full 


Tuesday, peal, later, home, happy, 
gratitude, gleaming, glance, color, battle 


5 


0.00 



which takes low values if Vi is close to the other nodes. 

The words with the lowest L include the characters "Sally" (L = 2.35, JV = 347) and "Fillmore" 
(L = 2.51, N = 138), in addition to high-frequency words, such as "say" (L = 2.45, JV = 349), "good" 
(L = 2.46, N = 107) and "man" (L = 2.50, iV = 193). As for the words with the highest L, we 
found: "white-clad" (L = 6.33, JV = 1), "affability" (L = 6.31, JV = 1), "whirl" (L = 5.89, iV = 1), 
"jazz" (L = 5.87, N = 1), "war-aims" [L = 5.84, N = 1). Interestingly, all these 5 words appeared 
only once in the text, indicating that one of the reasons for a high L could be the low frequency N. 
However, L is not only a consequence of the frequency N of the words, as low frequency words can 
also take low values of L. This is illustrated in Table [3j which compares words with the same N but 
different L. The frequency has a limited influence on L, with a Pearson correlation Corr(L,iV) = 
-0.36 calculated over all words. Actually, the determining factor is the neighborhood of the word. To 
understand why this happens, consider the words "affability" and "repose". While the former has 
as neighbors the words "jaunty" (JV = 1) and "white-clad" (N = 1), the latter has as neighbors the 
words "Sally" (N = 347) and night (JV = 20). Therefore, one may infer that L actually quantifies 
the importance of a word according to its distance to the most frequent words. Since we removed 
stopwords, the shortest path may be thought of as quantifying the distance from a word to the 
core-content words of the book. 
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Table 3: Comparing the average shortest path length L for words with the same frequency iV of the 
book "The Adventures of Sally" . For a given N, L may vary widely, which shows the dependency of 
L on the neighborhood connectivity. 



Word 






^Vord 


N 


Li 


red 


5 


3.71 


earth 


5 


2.99 


shudder 


4 


3.97 


lucky 


4 


3.00 


Maxwell 


3 


5.55 


funny 


3 


3.10 


dark 


2 


5.15 


kiss 


2 


3.08 


affability 


1 


6.34 


repose 


1 


3.11 



Table 4: Comparing the betweenness B for words of the book "The Adventures of Sally" with the 
same frequency N. For a given N, the betweenness may vary widely, since low frequency words may 
have high betweenness as they may appear in different contexts. 



Word 


Ni 




Word 


Ni 


B t 


say 


349 


745,634 


Sally 


347 


1,192,881 


know 


143 


243,357 


Fillmore 


138 


393, 955 


tell 


65 


53,904 


Gerald 


62 


108,528 


allow 


20 


15,816 


Roville 


21 


32, 449 


heaven 


10 


1,147 


second 


10 


22, 004 


rugger 


5 


855 


worthy 


5 


10,503 


fish 


4 


174 


spectator 


4 


14, 746 


paper-knife 


3 


233 


group 


3 


8,320 


worship 


2 


44 


sell 


2 


8,346 


thaw 


1 


11 


price 


1 


8,295 



Betweenness. 

Betweenness B is a measurement of centrality, with higher values being assigned to the nodes con- 
sidered as the most relevant in terms of linking different words. In other words, with B one attempts 
to quantify the frequency of access of each node, assuming that a given target node in the network is 
reached from a specific source node via shortest paths. Betweenness is defined as follows. Let rf st be 
the number of distinct shortest paths between the source node v s and the target node Vt that pass 
through the node V{. If g st is the total number of shortest paths between v s and v t , then B { is given 
by: 

* = (4) 

In the context of text analysis, high frequency words tend to have high B. However, some 
words may play the role of articulation points by linking concepts related to distinct communities. 
To illustrate this, we show in Table [4] that words with similar N may take very different B. A 
comparison between the left and right columns suggests that words with high B connect concepts 
because of their probable appearance in various contexts. Therefore, similarly to the clustering 
coefficient C, the betweenness centrality B seems to quantify the variety of contexts in which a word 
can appear. Note, however, that B is based on a global connectivity pattern, in contrast to C. 
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Table 5: In the book "The Adventures of Sally", by P. G. Wodehouse, there are Nt = 15, 173 
words (tokens), 3,657 different word types, and 716 words with iVj > 5. The 5 words with highest 
o~t/T are shown in the left part of the table. For comparison, in the right we show for each of these 
words another word with the closest frequency. 



Word 




h = a T /T 


Word 


Nt 


I = G T /T 


jules 


26 


4.31 


turn 


26 


1.55 


hobson 


31 


4.09 


here 


31 


1.35 


ginger 


115 


3.86 


get 


117 


1.24 


carmyle 


54 


3.60 


feel 


53 


0.87 


bunbury 


20 


3.59 


people 


20 


1.39 



2.3 Intermit tency measurements 

The uneven distribution of words across different documents is an essential feature exploited in 
Statistical Natural Language Processing. For instance, by investigating words appearing over con- 
centrated in specific documents (when compared to their overall frequency) one can detect keywords, 
topics, and authorship [271 H2]- This is the basic idea of the term frequency - inverse document 
frequency (TF-IDF) and related measures that are also at the core of search engines [12]. How- 
ever, there are numerous situations where the comparison to a general database is not available or 
is not interesting. For instance, when authorship has to be attributed without previous knowledge 
of texts written by the potential authors. Here we approach these problems by taking advantage 
of the finding that words are unevenly distributed not only across documents but also within 

them [161 EH CES [2D 1221 [221 121]. 

The quantification of the uneven distribution of words has been proposed based on measures 
commonly used by physicists [121 [21] • Following Refs. [2U [221 [21], we use the statistics of recurrence 
times, a standard quantification of intermittency or burstiness in time series [TH1 US] - In texts, time 
is counted by the number of words and for each word i the recurrence time Tj is defined as the 
number of words between two successive occurrences of i (the j and j + 1 occurrence) plus one. For 
instance, the recurrence times for the word "the" in the previous sentence are 7\ = 9 and T 2 = 7. 
A word that appears iVj times in a text of size Nt leads to a sequence of Nt — 1 inter-occurrence 
times {Ti, T 2 , T/v T -i}. In order to incorporate also the time until the first Tf and after the last 7} 
occurrence of the word, we consider T/v = Tf + T}. In this case T = Nt/N^, where the overline 
denotes average over the different Tj's. Note that the mean recurrence time Tj gives no additional 
information than the frequency iVj. The intermittency of the word appears in the variance of 7}'s 

around T and can be quantified by I = ctt/T where ot = \Jt 2 — T 2 . Randomly distributed words 
in the text have I = 1 (in the limit of large iV and small N^/Nt), intermittent words have I > 1, and 
words appearing in regular intervals have I < 1. We calculate the intermittency measure Ii = o~t/T 
for all words with iVj > 5 in each of the books (filtered texts) described in Sec. 2.1 The words with 
Ni < 5 were considered to lack statistics and were disregarded. 

In Table |5] we compare words with highest / = a/T to words with similar frequency. It is clear 
that the most intermittent words (largest o~t/T) are topical words (e.g., name of characters and 
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locations), regardless of their frequency. Indeed, 15 out of the 16 most intermittent words, are 
directly connected to specific characters. A similar behavior is observed in all books of our database. 
Intermittency is therefore a good characterization of topical words that in turn plays an important 
role in the author-specific characteristic of the texts. The relationship between o^jT and the function 
of the words has been investigated in detail in Refs. [TBI El UHl ED 1221 [21] • In the next section we 
explore the fact that these properties are also author specific [PT] . 

3 Evaluating the author dependency 

3.1 From properties of words to properties of books 

In the previous section we introduced five quantities characterizing properties of words in the text: 
frequency (N), average shortest path length (L), betweenness (B), clustering coefficient (C), and 
intermittency (/ = {<jt/T}). The values of these quantities for all words in the books in our 
database can be found in Si-Sec. 3. We now analyze the global distribution of these measurements 
for all the words in a given book by plotting the empirical probability density function p(X) for the 
measurements X = {N, L, B, C, I}. Fig. [2] shows the results for one book, and similar distributions 
were obtained for the other books. The shortest path L, clustering C, and intermittency / have a well 
defined peak and width (akin to a Gaussian distribution), but the frequency N and betweenness B 
have broad tail distributions (as in power law distributions p(X) ~ X~ a ). The tail in iV corresponds 
to the well-known Zipf 's law, which also appears in B as expected from the large correlation between 
B and iV (Corr(B,N) = 0.95 in the book of Fig. [2]). With the two different behaviors we propose 
two sets of measurements, one for X = {L, C, 1} and another for X = {N, B}. 

Our goal is to obtain quantities characterizing important features of these distributions to be used 
as global measurements of the books. The most natural choice is the average value (X), where (...) = 
JI Si=i • • • corresponds to an average over the M different words. For the network measures L, C, I 
this corresponds to the average values over nodes, a quantity considered as characteristic of the 
network [TJ H2J [EH 133] . For X = {N, B}, the highly frequent words contribute strongly to (X) 
due to the long tails. To compensate for this effect, we consider also a modified average defined 
as (X) 2 = (\ogX) for X = {N,B}. For X = {L,C,I} the opposite is true, i.e., (X) is dominated 
by the large number of low frequency words. Accordingly, we introduce a modified average as 
(X) 2 oc ^VAjlogiVj, i.e., a weighted average with weights proportional to the logarithm of the 
frequency. The quantities (X) and (A) 2 are expected to give a good account of "typical" values of X. 
However, in Sec. [2] we mentioned that important information is conveyed by words with large X, 
i.e., in the tails of the distributions shown in Fig. [2j In order to characterize the fat-tail distributions 
of X = {N, B} } we used the coefficient ax of a power-law fit to the tails of p(A)|^J An additional 
motivation for using comes from the suggestion in Ref. jl5] that it serves as a quantification of 
the style of texts. The large values of X = {L, C, 1} were characterized by calculating the skewness 

3 The fitting was performed to the cumulative distribution with logarithmic binning size, as suggested in Ref. |44j . 
A cut-off X > 3 10 3 was used for X = B (see Fig. pfe), no cut-off was used for X = N. 
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of p(X), a measure of the asymmetry of the distribution. In summary, the three features we use for 
each of the five quantities X = {N, B, L, C, 1} are: 

Average value: (X) for X = {N, B, L, C, I}. (5) 



u rr i ' y \ ) (log(*)) for X = {N,B}, 

Modified average: (X) 2 = < (6) 

1 (XlogN)/(\ogN) for X = {L,C,I}. 

ainX- a for X = {N,B}, 

Right tail: <y(X) = { /y/y\\ 3 (7) 

^ skewness(X) = (M^J ) for X = {L, C, /}. V7 



These features are given in Fig. [2] for one book (see Si-Sec. 3 for all 40 books). Obviously, the choice 
of the quantities above is inevitably arbitrary. Our choice was intended to capture features of the 
distribution, rather than giving a parametric description of the full distribution. In particular, the 
power-law fit in Eq. ^ does not intend to fully describe the distributions, as apparent in Fig. |2]^d,e). 



3.2 Machine learning methods and evaluation 

In order to quantify the ability of the features described above to distinguish between authors, we 
employ machine learning algorithms which induce classifiers from a training database. The robustness 
of our results is tested with three widely used algorithms based on different principles. The first is 
known as C4.5 [46J, and generates decision trees based on the information gained by each feature; 
the second algorithm is the Naive Bayes [37], which is based on the Bayes theorem; and the third and 
simplest algorithm is the Nearest Neighbor [3H], which classifies an unknown instance according to 
the nearest neighbor of that instance in a normalized space involving all features. For more details, 
see Si-Sec. 4. 



3.3 Efficiency of the classification 

We consider the problem of distinguishing between 8 authors, using five books to represent each 



author's style. More specifically, each book described in Sec. 2.1 was characterized by the set of 



15 features discussed in Sec. 3.1 ((X), (X) 2 and j(X) for X = {N, B, L,C, I}). The authorship 
assignment was performed using the algorithms in Sec. 3J2 applied to a training dataset independent 
of the test book using the cross validation methodology (see Si-Sec. 4). This technique ensures that 
the training and evaluation sets are different and it is equivalent to assigning the authorship of one 
book in an experiment where 4 books of 8 authors were used as a training dataset. The final output 
of the algorithms is the assignment of a specific author to each book tested, and the efficiency is 
quantified simply as the fraction of successful assignments. 

The results are summarized in Table [6] and indicate accuracy rates between 42.5% and 50.0% 
when all 15 features were used. These results were statistically significant by a large amount, con- 
firming that these features successfully capture author specific characteristics. To further explore the 
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Figure 2: Probability density function p{X) obtained from the different words of the book "The 
Adventures of Sally", by P. G. Wodehouse. (a) X = L shortest path, (b) X = C clustering coefficient, 
(c) X = I intermittency, (d) X = N frequency, and (e) X = B betweenness. In (d) and (e) the 
cumulative distribution p(x > X) = p(x)dx is shown, with the density p(X) depicted in the 
inset. The legends indicate the features defined in Eqs. obtained for these distributions, and 

Cor(X, N) indicates the Pearson correlation coefficient between X and N calculated over all words. 
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Table 6: Accuracy rate achieved for the three machine learning algorithms using all 15 features 
and the best combination of these features. The accuracy is estimated based on 40 authorship 
assignments. The p-values correspond to the probability of getting by chance a higher or equal 
accuracy in one (all features) and in 2 15 = 32, 768 (best case) trials. The features included in the 



best cases can be founc 


in SI-Tables S1-S4. 




Algorithms 

Decision Tree C4.5 Nearest neighbor kNN Naive Bayes 


All 15 features 
Best case 


50.0 % (p = 1 HT 8 ) 47.5 % (p = 6 10~ 8 ) 42.5 % (p = 2 10" 6 ) 
62.5 % (p = 5 10~ 9 ) 65.0 % (p = 4 HT 10 ) 62.5 % (p = 5 10~ 9 ) 



accuracy of different methods, we considered cases in which only some of the features were included 
in the algorithms. We tested all 2 15 = 32, 768 combinations of the 15 features and obtained a best 
result of 65.0% of correct assignments. 

3.4 Relative importance of different features 

In evaluating the importance of the different features on the final results it is essential to identify 
their mutual dependency. We start from the list of all 2 15 = 32, 768 combinations of features ordered 
by decreasing accuracy (as shown in SI-Tables S1-S4). We wish to quantify when feature y appears in 
the top of this list. To this end, we count the fraction of the 2 14 feature combinations that include y 
with accuracy higher or equal to a threshold. The final Estimate is then given by the area-under-the 
curve of the ROC plot obtained by varying the threshold [19]. This procedure is equivalent to the 
Mann- Whitney U test [50] . The motivation for using this method is that it evaluates the importance 
of a specific feature by taking into account how it combines with the other features to improve the 
accuracy of the prediction. The method depends both on the prediction algorithm and on the other 
features. 

The features were ranked based on the method described above. The results for the 3 prediction 
algorithms are given in the three first columns of Table [7} The three features appearing as the 
most prominent are (N) (average frequency), 7(f) (skewness of intermittency), and (N) 2 (average 
logarithmic frequency). In order to state the importance of features beyond specific algorithm it is 
important to quantify in which extent the results obtained for the three algorithms (first 3 columns) 
are consistent with each other. To this end we compute the Spearman's rank correlation and obtain 
the values 0.29 (p- value = 0.145), 0.49 (p-value= 0.032) and 0.67 (p-value= 0.003) for the pairs 
C4.5/kNN, C4.5/Bayes and kNN/Bayes, respectively. The p-values are computed under the null 
hypothesis that the rankings are independent. Altogether, the three p-values indicate that the three 
rankings are consistent with each other. This is a strong indication that our analysis goes beyond 
algorithm-specific results and indeed captures the influence from the features. 

It is interesting to compare the results to evaluations taking into account each feature separately. 
This can be done either by considering the accuracy of the prediction using only the specific feature 
or by comparing the information gained by including the feature |51j . This last method has the 
advantage of being independent of the prediction algorithm. These results are shown in the 4 last 
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Table 7: Ranking of features based on the accuracy rate of the classifiers, where 1 in the table 
means best, 2 second best and so on. The results for each classifier algorithm (C4.5, kNN and Bayes) 
are reported using different ranking procedures combined with multiple features (Mann- Whitney U 
test, columns 1-3, information gain (column 4) and accuracy using only one feature (columns 5-7)). 
The last column reports the Pearson correlation between each feature and the vocabulary size M 
(number of different words) calculated over the 40 books in our database. The features in the table 
are ordered according to the decreasing geometric mean of the ranks obtained in the 3 multiple 
features analysis (this ordering is the same achieved by considering for each feature the likelihood of 
reaching by chance a ranking as good as the one in each of the three ranking schemes). The areas 
under the curve in the multiple features analysis ranged between 56% and 69%. 





Multiple features 




Single feature 


Correlation 




C4.5 


kNN 


Bayes 


Info 


C4.5 


kNN 


Bayes 


with M 




6 


1 


1 


3 


2 


5 


1 


-0.90 


7(/) 


2 


2 


2 


10 


12 


9 


10 


-0.08 


(AO 


1 


6 


3 


2 


1 


2 


3 


-0.96 


(L) 


7 


4 


6 


9 


5 


3 


8 


0.85 


(B) 


5 


8 


5 


1 


3 


1 


2 


0.98 


{1)2 


10 


3 


10 


15 


15 


12 


12 


-0.34 


(Lh 


8 


7 


8 


8 


5 


7 


5 


0.85 


(C) 


12 


11 


4 


6 


5 


5 


5 


-0.87 


l{L) 


4 


13 


11 


13 


10 


9 


13 


-0.13 


l(B) 


3 


14 


14 


11 


8 


9 


9 


-0.07 


(Bh 


9 


9 


9 


7 


9 


14 


5 


0.88 


(C>2 


11 


10 


7 


5 


4 


3 


4 


-0.87 


(I) 


13 


5 


12 


12 


13 


15 


10 


-0.29 


7(iV) 


15 


12 


13 


4 


10 


8 


13 


0.81 


l(C) 


14 


15 


15 


14 


14 


12 


15 


0.07 



15 



Table 8: List of the 2 most relevant combinations of features, as revealed by a Factorial Analysis 
for the C4.5, kNN and Bayes classifier algorithms. As expected from Table [7| 7(7) provides good 
results when used in conjunction with other features, such as (N), (N) 2 7(7?) and (7) 2 





C4.5 


kNN 


Bayes 


1st Combination 
2nd Combination 


7 (7), (C) and (C) 2 
7(7) and (AT) 


7(1) and 7(5) 
7(7) and (N) 2 


(AO, (AT} 2 and (B) 
(7) 2 and 7 (7) 



columns of Table [7j Note that some features appearing as very important in the multiple features 
analysis are not informative when taken alone (e.g., the skewness of the intermittency 7(7)). On the 
other hand, features that are well ranked in the single feature analysis do not always appear among 
the most important features when multiple features are considered (e.g., the weighted average of the 
clustering (C) 2 ). These observations show the nontrivial mutual dependency of the features. To 
further explore this we performed a factorial analysis (see Si-Sec. 5) using the 12 most important 
features in Table [7| with the most important combinations being summarized in Table [8} As expected 
from Table [7| in fact 7(7) appears among the 2 best combinations of features in all three algorithms, 
which confirms that its effectiveness is correlated with its interdependence with other features. 



4 Discussion and conclusions 



4.1 Interpretation of the results 



We are now in a position to use the word-specific analysis (Sec. |2J) and the distribution (Sec. |3.1 ) of 
the quantities X = {N, B, 7, C, 7} to assess the importance of the different features (X), (X) 2 , and 
7 (X) in Table [7| 

N frequency. This was the most efficient quantity for recognizing authorship with (N) and (A} 2 
among the 3 most important features. Noting that (N) is proportional to the inverse number 
of distinct words M: 

^ = length of book ^ 

one infers that the distinguishing feature between the authors captured by (N) is the different 
vocabulary sizes. The modified average {N) 2 = (log A) also captures this aspect, including 
the proportion of frequent and infrequent words. On the other hand, the poor performance of 
^y(N) (= a in p(N) ~ N~ a ) is a clear signature of the universal, author- in dependent, character 
of Zipf's law (at fixed book size 



B betweenness. The average betweenness (B) was useful because of its strong correlation with the 
vocabulary size of the book M (last column in Table [7]). In network terms, this corresponds to a 
linear relationship between (B) and network size (M, number of nodes) and can be understood 
by noting that the number of terms in the sum of definition of 7?j in Eq. ^ is proportional 
to M 2 , so that (B) is expected to scale linearly with M for a fixed book size. The fact that (7>} 2 
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and 7(5) had a poor performance indicates that the number of words with high betweenness 
is not a relevant feature to distinguish between authors. 



L shortest path. This was the network quantity with best performance. (L) quantifies the 
typical distance of words to the central hubs of the network (frequent words). The good 
performance points to a dependence on the style of the authors. The poorer performance 
of (L) 2 and 7(£) indicates that the style dependency in L is more prominent in the typical 
values than, respectively, in the frequent and large L words. 

C clustering. The poor performance of all values related to this quantity suggests that authors 
have very little freedom in choosing the clustering of words co-occurrence networks. The last 
position in the ranking of 7(C) in Table [7] suggests that the fraction of words used in specific 
contexts (high C) is author independent. The two averages (C)2 and (C) take similar values 



as seen in Fig. 2 and in Si-Sec. 3, recall the restriction Ni > 5 used in Sec. 2.2). They perform 



well only when used alone, possibly because of their correlation to vocabulary size M. 

I intermittency. Apart from the frequency, intermittency was the most important quantity in 
Table [7] with the skewness of the distribution 7(f) playing a prominent role. In view of the 



results in Sec. 2.3, 7(1) may be interpreted as the fraction of all words that are topical or 
"keyword like" . The poor performance of (J) is not surprising since I is normalized by frequency 
(/ = cj t /T) and therefore (/) ~ 1 is expected. Indeed, from all 5 quantities the J— features 
have shown altogether the smallest absolute value of correlation with vocabulary size (Tab. [7]), 
explaining why even 7(1) has a poor relative performance when used alone. Finally, {1)2 
performs better than (/) suggesting that frequent words are the more relevant ones. 



4.2 Comparison with other prediction methods 

Even if the main goal of this paper is to evaluate the importance of different factors, it is also useful 
to compare the accuracy of our results with other methods of authorship attribution. Uzuner and 
Katz [52] used a database of books similar to ours, produced by 8 authors. They used five sets of 
features, including simple statistics and more sophisticated syntactic analysis (Table 3 of Ref. [52J). 
Our best results (accuracy of 65%) is comparable to their second best case obtained using "syntactic 
elements of expression" (62%), being significantly worse only than their best result, achieved using 
function words (87%). In an extensive review, Grieve [31] reported accuracies obtained with a set 
of 34 features varying between 33% and 87% for the case of 5 authors, and between 18% and 80% 
for 10 authors (Table 9 in Ref. [31J). Our best results are above the median of their results achieved 
by using different features. Their best results again are based on the relative frequency of function 
words. These results are in accordance with the long tradition started by Mosteller and Wallace to 
use the frequency of function words to distinguish between authors [27J. 

In order to confirm this in our database, we implemented a series of prediction schemes using the 
frequency N{ of frequent (mostly function) words. Differently from the approach described in this 
paper that used average and scaling properties as features, now the frequencies of specific words are 
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used directly as input features of the prediction algorithms. We used only the kNN algorithm because 
the other algorithms did not provide good results when too many features were included. When the 
list of 70 stopwords from Table 2.5 of Ref. [27] was used, we obtained an accuracy of 62.5%, i.e., 
comparable to our best results. Following Ref. [31], we considered two other lists of words: all 1,978 
words that appear in at least one book of each of the authors, leading to accuracy of 90%; and all 209 
words that appear at least once in every book in our database, leading to an accuracy of 82.5%. We 
recall that in order to concentrate on analysis that focus on words with pronounced semantic content 
instead of function words, we have deliberately excluded a list of stopwords that comprised 80% 
of the cases listed in Ref. [27]. Therefore, our best combination of features compares well to other 
methods which demand more sophisticated syntactic analysis of the text. Measurements of complex 
network and intermittency are indeed able to capture many of the author-dependent characteristics. 

In order to illustrate how measures analyzed in this paper can be complementary to traditional 
methods we have performed a very simple experiment using as features the frequency and intermit- 
tency of the set of words composed by the five most frequent words in each book. The accuracy 
in classifying the books only by the frequency was 72.5 % and only by intermittency was 37.5 %. 
Although this last accuracy rate is not impressive, it is statistically significant (p — 2.2 10~ 4 ) and 
shows that the intermittency values of specific words across distinct authors is different. The accu- 
racy is increased to 80 % when both features were included. It remains to be shown in future works 
how our results can improve state of the art methods of authorship attribution. 

4.3 Summary of conclusions 

We have shown that the style of different authors leave fingerprints in very general statistical measures 
of texts based on the network of co-occurrence of words and on intermittency or burstiness of words. 
The statistically significant scores obtained in authorship attribution unequivocally show that the 
style dependence of these features can be used in practice. Regarding the prominence of the different 
features, we note that both the results and ranking of features may depend on the database, selected 
features and attribution algorithms. Accordingly, as emphasized in Ref. [31] , different algorithms and 
features have to be tested in a given corpus before any real application of authorship attribution. 
However, the robustness of our results using three radically different attribution algorithms strongly 
suggests that the different features have importance that go beyond specific algorithms. Two features 
should be highlighted: (i) the skewness of the distribution of intermittent words 7(/), which is based 
on the long-scale distribution of words and detects the extent into which topical words (keywords, 
large / = <Jt/T) were used in the book; and (ii) the mean shortest path of the word co-occurrence 
network (L), which is based on the short-range connectivity of words and detects the typical distance 
of words to all other words. The different natures of these two quantities suggest a complementary 
role for capturing both short- and long-scale properties of the text, as well as typical and exceptional 
words. 

Our focus in this paper was on the evaluation of the different features, rather than on maximizing 
the efficiency of the authorship attribution algorithms. This is apparent when comparing the best 
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accuracy rates we achieved using our approach (62.5%) and using previous proposals (90.0%), as 



discussed in Sec. 4.2 A further limitation of approaches based on intermittency and networks 
is that they require large pieces of text. While the root of the success of previous methods rely 
on the observation that function words are a powerful tool to detect the style of authors [271 EH 
|5"2"] . in the complex network and intermittency approaches used in this paper the focus is on the 
content words. In this sense the results we achieve can be thought as being complementary to 
the analysis using function words. More specifically, our results suggest that using 'y(I) and (L) can 
improve authorship recognition techniques when used in combination with the many different features 
currently employed [31]. Finally, the successful application of these measurements to characterize the 
style of authors suggests that the quantities discussed here can be further explored in other linguistic 
tasks, an approach that has been limited to a few works (see e.g. [HI 155]). 
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